SportsStats Olympic Athletes Analysis

This project analyzes over 120 years of Olympic Games data to uncover fascinating insights about athletic performance, participation trends, and country achievements. Using Python and SQL, I explored athlete demographics, medal distributions, and how physical attributes vary across sports. The analysis reveals how gender participation has evolved, which countries excel in different sports, and what physical characteristics correlate with success. These insights help understand the evolution of the Olympics and the factors that contribute to athletic excellence on the world stage.

Project Overview

Project: Data Cleaning and Analysis
Date: June 2025
Category: Data Analysis
Tools: Python, SQL, Pandas, Matplotlib
Role: Data Analyst

Analysis Notebook

Best Viewed on Larger Screens

For the best experience viewing this analysis notebook, please use a laptop, desktop computer, or tablet in landscape mode.

SportsStats-Project-file.ipynb

import pandas as pd

import matplotlib.pyplot as plt

# Load your Olympic athlete data
df_olympics = pd.read_csv("athlete_events.csv")
noc_df = pd.read_csv("noc_regions.csv")

df_olympics.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 271116 entries, 0 to 271115 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 271116 non-null int64 1 Name 271116 non-null object 2 Sex 271116 non-null object 3 Age 261642 non-null float64 4 Height 210945 non-null float64 5 Weight 208241 non-null float64 6 Team 271116 non-null object 7 NOC 271116 non-null object 8 Games 271116 non-null object 9 Year 271116 non-null int64 10 Season 271116 non-null object 11 City 271116 non-null object 12 Sport 271116 non-null object 13 Event 271116 non-null object 14 Medal 39783 non-null object dtypes: float64(3), int64(2), object(10) memory usage: 20.7+ MB

# 1. Remove rows with null in critical columns
df_olympics = df_olympics.dropna(subset=['Name', 'Sex', 'Age', 'Height', 'Weight', 'NOC', 'Sport', 'Event'])

# 2: Remove duplicate rows
df_olympics = df_olympics.drop_duplicates()

# 3. Convert data types
df_olympics['Age'] = df_olympics['Age'].astype(int)
df_olympics['Height'] = df_olympics['Height'].astype(int)
df_olympics['Weight'] = df_olympics['Weight'].astype(int)

# 4. Join with noc_regions.csv on 'NOC'
df_merged = df_olympics.merge(noc_df, on='NOC', how='left')

# 5. Fill missing region values (if needed)
df_merged['region'] = df_merged['region'].fillna('Unknown')

# Total records
print("Total Records:", df_merged.shape[0])

# Unique countries
print("Total Countries:", df_merged['region'].nunique())

# Top 5 Medal-Winning Countries
print("\nTop Countries by Medal Count:")
print(df_merged[df_merged['Medal'].notna()]['region'].value_counts().head(5))

# Male vs Female participation
print("\nGender Participation:")
print(df_merged['Sex'].value_counts())

Total Records: 206152 Total Countries: 206 Top Countries by Medal Count: region USA 4383 Russia 3610 Germany 3189 Australia 1210 Italy 1060 Name: count, dtype: int64 Gender Participation: Sex M 139441 F 66711 Name: count, dtype: int64

import duckdb

query = """
    SELECT region, COUNT(*) AS total_medals
    FROM 'olympics_cleaned.csv'
    WHERE medal IS NOT NULL
    GROUP BY region
    ORDER BY total_medals DESC
    LIMIT 10
"""

total_medals = duckdb.query(query).df()
print(total_medals)

region total_medals 0 USA 4383 1 Russia 3610 2 Germany 3189 3 Australia 1210 4 Italy 1060 5 Canada 1060 6 UK 1031 7 China 989 8 France 987 9 Japan 843

# Plot Top 10 Countries by Total Medals
plt.figure(figsize=(10,6))
plt.bar(df_medals['region'].head(10), df_medals['total_medals'].head(10), color='steelblue')
plt.title('Top 10 Countries by Total Medals')
plt.ylabel('Total Medals')
plt.xlabel('Country')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

gender_participation = duckdb.query("""
    SELECT 
        year, 
        sex, 
        COUNT(DISTINCT id) AS athlete_count
    FROM 'olympics_cleaned.csv'
    GROUP BY year, sex
    ORDER BY year, sex
""").df()

male_data = gender_participation[gender_participation['Sex'] == 'M']
female_data = gender_participation[gender_participation['Sex'] == 'F']

plt.figure(figsize=(12, 6))
plt.plot(male_data['Year'], male_data['athlete_count'], label='Male', color='blue')
plt.plot(female_data['Year'], female_data['athlete_count'], label='Female', color='red')
plt.title('Male vs Female Athlete Participation Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Athletes')
plt.legend()
plt.show()

query = """
SELECT 
    Sport,
    COUNT(*) AS total_athletes,
    AVG(Age) AS avg_age,
    AVG(Height) AS avg_height,
    AVG(Weight) AS avg_weight
FROM "olympics_cleaned.csv"
WHERE Height IS NOT NULL AND Weight IS NOT NULL AND Age IS NOT NULL
GROUP BY Sport
ORDER BY avg_height DESC
"""

Physical_profiles_by_sport = duckdb.query(query).df()
print(Physical_profiles_by_sport.head(10))

Sport total_athletes avg_age avg_height avg_weight 0 Basketball 3668 25.453926 191.226554 85.801527 1 Volleyball 3277 25.237107 186.978944 78.900214 2 Beach Volleyball 538 29.135688 186.202602 79.089219 3 Water Polo 2719 25.705774 185.036043 84.568224 4 Rowing 7790 25.413607 184.265982 80.168164 5 Handball 3319 26.492618 183.483579 81.503465 6 Baseball 846 26.309693 182.599291 85.717494 7 Tug-Of-War 20 26.500000 182.550000 91.800000 8 Bobsleigh 2205 28.650794 181.608617 89.161451 9 Motorboating 1 27.000000 181.000000 77.000000

query = """
SELECT 
    Sport,
    AVG(Age) AS avg_age
FROM "olympics_cleaned.csv"
WHERE Medal IS NOT NULL
  AND Sport IN ('Rhythmic Gymnastics', 'Equestrianism')
GROUP BY Sport
"""

Avg_Age_by_Sport = duckdb.query(query).df()
print(Avg_Age_by_Sport)

Sport avg_age 0 Rhythmic Gymnastics 18.911290 1 Equestrianism 35.315234

query = """
SELECT 
    region,
    COUNT(DISTINCT ID) AS total_athletes,
    COUNT(Medal) AS total_medals,
    ROUND(COUNT(Medal) * 1.0 / COUNT(DISTINCT ID), 2) AS medal_efficiency
FROM 'olympics_cleaned.csv'
WHERE Medal IS NOT NULL
GROUP BY region
HAVING total_athletes > 0
ORDER BY medal_efficiency DESC
"""

medal_efficiency_per_athletes = duckdb.query(query).df()
print(medal_efficiency_per_athletes.head(10))

region total_athletes total_medals medal_efficiency 0 Namibia 1 4 4.0 1 Kuwait 1 2 2.0 2 Liechtenstein 4 8 2.0 3 Grenada 1 2 2.0 4 Mozambique 1 2 2.0 5 Suriname 1 2 2.0 6 Afghanistan 1 2 2.0 7 Ecuador 1 2 2.0 8 Costa Rica 2 4 2.0 9 Jamaica 81 154 1.9

Project Presentation

Best Viewed on Larger Screens

For the best experience viewing this presentation, please use a laptop, desktop computer, or tablet in landscape mode.

SportsStats Olympic Athletes Analysis Presentation

Project Details

Goal & context. This project analyzes over 120 years of Olympic Games data (1896–2016) to uncover patterns in athlete demographics, medal performance, and participation. The work combines Python (pandas, matplotlib) with SQL (DuckDB) to handle large volume (271K+ rows) and deliver clear insights on countries, sports, gender, and physical attributes—suitable for both narrative reporting and presentation.

Data cleaning & preparation. Raw athlete and NOC/region data were loaded and cleaned in Python: missing values in critical columns (e.g., age, height, weight, NOC, sport) were handled; duplicates were removed and data types were standardized (e.g., integer for age and dimensions). NOC codes were joined to region names for country-level analysis. A clean, analysis-ready dataset was then exported (e.g., to CSV) for use in DuckDB and in visualizations.

Analysis with SQL (DuckDB). DuckDB was used to run efficient SQL queries on the cleaned dataset: medal counts by country, participation by year and sex, physical profiles (e.g., average height/weight/age by sport), and medal efficiency (medals per athlete by country). Aggregations and filters were designed to answer specific questions (top countries, gender trends over time, sport-specific demographics) and to feed both narrative insights and charts.

Visualization & storytelling. Key results were visualized with matplotlib (e.g., bar charts for medal rankings, line charts for participation over time). Charts were labeled and styled for clarity. The analysis was documented in a structured notebook and summarized in a presentation so that findings on Olympic history, gender progress, and country performance could be communicated clearly to a non-technical audience.

Outcome. The project delivers a reproducible Python + SQL workflow, a set of validated insights on Olympic athletes and nations, and presentation-ready visuals that demonstrate the ability to clean, model, query, and visualize a large historical dataset end to end.

Analysis Overview

Data Cleaning & Preparation
Processed raw Olympic athlete data by handling missing values, removing duplicates, converting data types, and merging with regional information for comprehensive analysis.
Medal Distribution Analysis
Analyzed medal counts across countries, identifying USA, Russia, and Germany as the top three medal-winning nations with 4,383, 3,610, and 3,189 total medals respectively.
Gender Participation Trends
Tracked the evolution of male and female athlete participation over 120+ years, revealing significant growth in women's participation especially after the 1980s.
Physical Attribute Analysis
Examined height, weight, and age distributions across different sports, finding that Basketball athletes average 191cm height while Rhythmic Gymnasts have the youngest average age of 18.9 years.
Medal Efficiency Metrics
Calculated medals per athlete ratios to identify efficient sporting nations, revealing that smaller countries like Jamaica achieve exceptional efficiency with 1.9 medals per athlete.

Explore More Projects

Discover other data analysis projects and interactive dashboards

View All Projects